61 research outputs found

    Robust distance correlation for variable screening

    Full text link
    High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dimensional data due to low computational efficiency and unstable algorithm. Sure screening methods have become popular alternatives by first rapidly reducing the dimension using simple measures such as marginal correlation then applying any regularization methods. A number of screening methods for different models or problems have been developed, however, none of the methods have targeted at data with heavy tailedness, which is another important characteristics of modern big data. In this paper, we propose a robust distance correlation (``RDC'') based sure screening method to perform screening in ultrahigh-dimensional regression with heavy-tailed data. The proposed method shares the same good properties as the original model-free distance correlation based screening while has additional merit of robustly estimating the distance correlation when data is heavy-tailed and improves the model selection performance in screening. We conducted extensive simulations under different scenarios of heavy tailedness to demonstrate the advantage of our proposed procedure as compared to other existing model-based or model-free screening procedures with improved feature selection and prediction performance. We also applied the method to high-dimensional heavy-tailed RNA sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer cohort and RDC was shown to outperform the other methods in prioritizing the most essential and biologically meaningful genes

    Bayesian indicator variable selection of multivariate response with heterogeneous sparsity for multi-trait fine mapping

    Full text link
    Variable selection has been played a critical role in contemporary statistics and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but they mainly target at only one response. As more data being collected nowadays, it is common to obtain and analyze multiple correlated responses from the same study. Running separate regression for each response ignores their correlation thus multivariate analysis is recommended. Existing multivariate methods select variables related to all responses without considering the possible heterogeneous sparsity of different responses, i.e. some features may only predict a subset of responses but not the rest. In this paper, we develop a novel Bayesian indicator variable selection method in multivariate regression model with a large number of grouped predictors targeting at multiple correlated responses with possibly heterogeneous sparsity patterns. The method is motivated by the multi-trait fine mapping problem in genetics to identify the variants that are causal to multiple related traits. Our new method is featured by its selection at individual level, group level as well as specific to each response. In addition, we propose a new concept of subset posterior inclusion probability for inference to prioritize predictors that target at subset(s) of responses. Extensive simulations with varying sparsity and heterogeneity levels and dimension have shown the advantage of our method in variable selection and prediction performance as compared to existing general Bayesian multivariate variable selection methods and Bayesian fine mapping methods. We also applied our method to a real data example in imaging genetics and identified important causal variants for brain white matter structural change in different regions.Comment: 29 pages, 3 figure

    MEMD-ABSA: A Multi-Element Multi-Domain Dataset for Aspect-Based Sentiment Analysis

    Full text link
    Aspect-based sentiment analysis is a long-standing research interest in the field of opinion mining, and in recent years, researchers have gradually shifted their focus from simple ABSA subtasks to end-to-end multi-element ABSA tasks. However, the datasets currently used in the research are limited to individual elements of specific tasks, usually focusing on in-domain settings, ignoring implicit aspects and opinions, and with a small data scale. To address these issues, we propose a large-scale Multi-Element Multi-Domain dataset (MEMD) that covers the four elements across five domains, including nearly 20,000 review sentences and 30,000 quadruples annotated with explicit and implicit aspects and opinions for ABSA research. Meanwhile, we evaluate generative and non-generative baselines on multiple ABSA subtasks under the open domain setting, and the results show that open domain ABSA as well as mining implicit aspects and opinions remain ongoing challenges to be addressed. The datasets are publicly released at \url{https://github.com/NUSTM/MEMD-ABSA}

    Evaluation of Changes in the Characteristic Flavor of Ultra-high Temperature Sterilized Milk under the Effects of Temperature and Light

    Get PDF
    In order to study changes in the characteristic flavor of ultra-high temperature sterilized (UHT) milk under the influence of storage temperature and light, headspace solid phase microextraction (SPME) combined with gas chromatography-mass spectrometry (GC-MS) was used to detect the volatile flavor components of the product. Descriptive sensory evaluation, orthogonal partial least squares-discriminant analysis (OPLS-DA) and entropy weight method were used to determine the relationship between major characteristic flavors and characteristic substances. The effects of temperature and light flux on the flavor changes of different formulations of UHT milk were analyzed, and a model for comprehensive analysis of the characteristic flavors of UHT milk was developed based on the effects of initial unsaturated fatty acid content, temperature and light flux. The results of this research provide support for the quality control of different formulations of UHT milk

    Psychometric assessment of HIV/STI sexual risk scale among MSM: A Rasch model approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Little research has assessed the degree of severity and ordering of different types of sexual behaviors for HIV/STI infection in a measurement scale. The purpose of this study was to apply the Rasch model on psychometric assessment of an HIV/STI sexual risk scale among men who have sex with men (MSM).</p> <p>Methods</p> <p>A cross-sectional study using respondent driven sampling was conducted among 351 MSM in Shenzhen, China. The Rasch model was used to examine the psychometric properties of an HIV/STI sexual risk scale including nine types of sexual behaviors.</p> <p>Results</p> <p>The Rasch analysis of the nine items met the unidimensionality and local independence assumption. Although the person reliability was low at 0.35, the item reliability was high at 0.99. The fit statistics provided acceptable infit and outfit values. Item difficulty invariance analysis showed that the item estimates of the risk behavior items were invariant (within error).</p> <p>Conclusions</p> <p>The findings suggest that the Rasch model can be utilized for measuring the level of sexual risk for HIV/STI infection as a single latent construct and for establishing the relative degree of severity of each type of sexual behavior in HIV/STI transmission and acquisition among MSM. The measurement scale provides a useful measurement tool to inform, design and evaluate behavioral interventions for HIV/STI infection among MSM.</p

    Assessing trend and variation of Arctic sea-ice extent during 1979-2012 from a latitude perspective of ice edge

    Get PDF
    Arctic sea-ice extent (in summer) has been shrinking since the 1970s. However, we have little knowledge of the detailed spatial variability of this shrinking. In this study, we examine the (latitudinal) ice extent along each degree of longitude, using the monthly Arctic ice index data sets (1979–2012) from the National Snow and Ice Data Center. Statistical analysis suggests that: (1) for summer months (July–October), there was a 34-year declining trend in sea-ice extent at most regions, except for the Canadian Arctic Archipelago, Greenland and Svalbard, with retreat rates of 0.0562–0.0898 latitude degree/year (or 6.26–10.00 km/year, at a significance level of 0.05); (2) for sea ice not geographically muted by the continental coastline in winter months (January–April), there was a declining trend of 0.0216–0.0559 latitude degree/year (2.40–6.22 km/year, at a significance level of 0.05). Regionally, the most evident sea-ice decline occurred in the Chukchi Sea from August to October, Baffin Bay and Greenland Sea from January to May, Barents Sea in most months, Kara Sea from July to August and Laptev Sea and eastern Siberian Sea in August and September. Trend analysis also indicates that: (1) the decline in summer ice extent became significant (at a 0.05 significance level) since 1999 and (2) winter ice extent showed a clear changing point (decline) around 2000, becoming statistically significant around 2005. The Pacific–Siberian sector of the Arctic accounted for most of the summer sea-ice decline, while the winter recovery of sea ice in the Atlantic sector tended to decrease.Keywords: NSIDC ice index; Arctic; sea-ice extent; ice-edge latitude.(Published: 11 September 2014)Citation: Polar Research 2014, 33, 21249, http://dx.doi.org/10.3402/polar.v33.2124
    • …
    corecore